Refactor cloud provider creation options #8583

elmiko · 2025-09-29T16:46:35Z

What type of PR is this?

/kind cleanup
/kind api-change

What this PR does / why we need it:

This patch series changes an argument to the NewCloudProvider function to use an AutoscalerOptions struct instead of AutoscalingOptions. This change allows cloud providers to have more control over the core functionality of the cluster autoscaler.

In specific, this patch series also adds a method named RegisterScaleDownNodeProcessor to the AutoscalerOptions so that cloud providers can inject a custom scale down processor.

Lastly, this change adds a custom scale down processor to the clusterapi provider to help it avoid removing the wrong instance during scale down operations that occur during a cluster upgrade.

Which issue(s) this PR fixes:

Fixes #8494

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

elmiko · 2025-09-29T16:48:12Z

this is an alternate solution instead of #8531

jackfrancis · 2025-09-29T20:46:39Z

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_nodegroup.go

+		}
+
+		// Check if any of the old MachineSets still have replicas
+		replicas, found, err := unstructured.NestedInt64(ms.UnstructuredContent(), "spec", "replicas")


I think checking spec first, then status, is the least expensive order, but wanna think out loud and double-check.

When a "new" MachineSet is replacing an "old" MachineSet in the service of a change to the corresponding MachineDeployment, it will first reduce the spec count, and then rely upon the infra provider to validate deletion, and then the MachineSet will apply the final reduction to its status.

If the above is ~true, then I think this is the right way to process through a MachineSet resource representation to definitively determine whether or not the "parent" MachineDeployment resource is in any active state of rolling out.

i am less familiar with the upgrade logic inside of cluster-api, i would again defer to @sbueringer here for an answer.

If the above is ~true

It is

jackfrancis · 2025-09-29T20:47:41Z

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_nodegroup.go

+	}
+
+	if len(machineSets) == 0 {
+		// No MachineSets => MD is not rolling out.


What are the specific scenarios that would produce a MachineDeployment with zero corresponding MachineSet resources?

good question, this code is taken from a patch that @sbueringer developed. i'll let him answer.

MachineDeployment was created but MD controller did not create MS yet.
Should usually just occur for a very short period of time, but I would prefer handling it explicitly vs going through the code below without any MS

ACK, makes sense. Do we really want to tag a MachineDeployment in such a state as a candidate for scale down? I would think we would return true, nil here.

Hm, in general there will be at most Machines of the new MS if there are even Machines. And it would be okay to downscale these Machines

elmiko · 2025-09-30T13:18:46Z

/test pull-cluster-autoscaler-e2e-azure-master

elmiko · 2025-09-30T13:19:06Z

the circular import in the unit tests is giving me some problems, not quite sure where it's starting.

elmiko · 2025-09-30T13:39:44Z

i am not able to reproduce the unit test failure with the circular import locally. not sure the best course of action.

/retest

elmiko · 2025-09-30T14:26:36Z

/retest-required

elmiko · 2025-09-30T19:21:57Z

/retest

elmiko · 2025-09-30T19:22:18Z

not sure why the tests aren't triggering.

This change helps to prevent circular dependencies between the core and builder packages as we start to pass the AutoscalerOptions to the cloud provider builder functions.

this changes the options input to the cloud provider builder function so that the full autoscaler options are passed. This is being proposed so that cloud providers will have new options for injecting behavior into the core parts of the autoscaler.

this change adds a method to the AutoscalerOptions struct to allow registering a scale down node processor.

This change adds a custom scale down node processor for cluster api to reject nodes that are undergoing upgrade.

elmiko · 2025-09-30T19:53:06Z

rebased on master

elmiko · 2025-09-30T20:13:17Z

i'm not sure how i've broken this, but these error lines are not making sense to me:

package k8s.io/autoscaler/cluster-autoscaler/processors/customresources
	imports k8s.io/autoscaler/cluster-autoscaler/cloudprovider/gce from gpu_processor_test.go
	imports k8s.io/autoscaler/cluster-autoscaler/core/options from gce_cloud_provider.go
	imports k8s.io/autoscaler/cluster-autoscaler/processors from autoscaler.go
Error: 	imports k8s.io/autoscaler/cluster-autoscaler/processors/customresources from processors.go: import cycle not allowed in test

elmiko · 2025-09-30T20:18:26Z

@towca does this error looks familiar to you at all?

this change removes the import from the gce module in favor of using the string value directly.

elmiko · 2025-09-30T20:46:15Z

ok, think my last commit on this chain fixed the circular import.

jackfrancis · 2025-10-02T21:33:04Z

We have a conversation going on here that will potentially improve the gpu node label concern that this PR encountered:

CAS: move DRA consts go into core #8595 (review)

/lgtm
/approve
/hold

for @towca to give his thumbs up

k8s-ci-robot · 2025-10-02T21:33:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko, jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [jackfrancis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

elmiko · 2025-10-03T13:16:34Z

We have a conversation going on here that will potentially improve the gpu node label concern that this PR encountered:

i'm fine to change the label in the test for this PR, i was just trying to stop the circular import error.

k8s-ci-robot requested a review from aleksandra-malinowska September 29, 2025 16:46

k8s-ci-robot added the area/provider/azure Issues or PRs related to azure provider label Sep 29, 2025

k8s-ci-robot requested a review from apricote September 29, 2025 16:46

elmiko mentioned this pull request Sep 29, 2025

Add IsNodeCandidateForScaleDown cloud provider interface #8531

Closed

jackfrancis reviewed Sep 29, 2025

View reviewed changes

elmiko force-pushed the provider-options-refactor branch 3 times, most recently from f60b7f0 to 1afa484 Compare September 30, 2025 02:49

elmiko added 4 commits September 30, 2025 15:52

refactor core.AutoscalerOptions in a new package

766ee87

This change helps to prevent circular dependencies between the core and builder packages as we start to pass the AutoscalerOptions to the cloud provider builder functions.

add scale down registration method

b8fb85a

this change adds a method to the AutoscalerOptions struct to allow registering a scale down node processor.

add clusterapi scale down upgrade processor

a5f07fe

This change adds a custom scale down node processor for cluster api to reject nodes that are undergoing upgrade.

elmiko force-pushed the provider-options-refactor branch from 1afa484 to a5f07fe Compare September 30, 2025 19:52

refactor gpu_processor_test to remove cyclic dependency

b82aa40

this change removes the import from the gce module in favor of using the string value directly.

jackfrancis mentioned this pull request Sep 30, 2025

CAS: move DRA consts go into core #8595

Closed

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 2, 2025

k8s-ci-robot assigned jackfrancis Oct 2, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 2, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 2, 2025

jackfrancis mentioned this pull request Oct 3, 2025

test: use custom gpu node config for processor tests #8607

Open

Refactor cloud provider creation options #8583

Are you sure you want to change the base?

Refactor cloud provider creation options #8583

Conversation

elmiko commented Sep 29, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

elmiko commented Sep 29, 2025

Uh oh!

jackfrancis Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

elmiko Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

sbueringer Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

jackfrancis Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

elmiko Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

sbueringer Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

jackfrancis Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

sbueringer Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elmiko commented Sep 30, 2025

Uh oh!

elmiko commented Sep 30, 2025

Uh oh!

elmiko commented Sep 30, 2025

Uh oh!

elmiko commented Sep 30, 2025

Uh oh!

elmiko commented Sep 30, 2025

Uh oh!

elmiko commented Sep 30, 2025

Uh oh!

elmiko commented Sep 30, 2025

Uh oh!

elmiko commented Sep 30, 2025

Uh oh!

elmiko commented Sep 30, 2025

Uh oh!

elmiko commented Sep 30, 2025

Uh oh!

jackfrancis commented Oct 2, 2025

Uh oh!

k8s-ci-robot commented Oct 2, 2025

Uh oh!

elmiko commented Oct 3, 2025

Uh oh!

Uh oh!

sbueringer Sep 30, 2025 •

edited

Loading